Hierarchical agglomerative clustering is a “bottom-up” method of clustering. Each observation begins as its own cluster and forms clusters with like items as it moves up the hierarchy. That is, all leaves are their own clusters to begin with and form clusters as we move up the trunk and various branches are formed.
Distance and cluster method information are usually displayed at the bottom of the graph, while the vertical axis displays the height, which refers to the distance between two clusters. We are not concerned as much with distances along the horizontal axis. We can also “cut” the dendrogram to specify a number of clusters, which is similar to defining k in k-means clustering (which is also equally problematic).
In a real-life research situation, you will likely want to scale the data. However, raw data are used in this example. # Package installation
if (FALSE) {
# Run this line manually (once) to install the necessary packages.
# Install packages from CRAN:
install.packages(c("ape", "pvclust", "mclust"))
}
# fancy dendrogram options
library(ape)
# dendrograms with p-values
library(pvclust)
# model-based clustering
library(mclust)
## Package 'mclust' version 5.4.1
## Type 'citation("mclust")' for citing this R package in publications.
data(mtcars)
?mtcars
Start by using the hclust built-in function from {stats}. hclust prefers a dissimilarity matrix via the dist function, thus it plots rows as opposed to columns like the methods further below.
hclust built-in function# See the help files
?hclust
# Create distance matrix
mtcars_dist = dist(mtcars, method = "euclidean")
# Fit hclust_model
system.time({
hclust_model = hclust(mtcars_dist, method = "complete")
})
## user system elapsed
## 0.000 0.000 0.001
# Plot hclust_model dendrogram
plot(hclust_model, hang = -1)
Data are visualized in dendrograms, or branching tree-like structures similar to decision trees, albeit with less information displayed at each node. The most similar items are found lower in the dendrogram and fuse into \(n-1\) clusters as we move up the tree; the next two items to fuse into a cluster produce \(n-2\) clusters and so on as we move up the tree until there is just one overarching cluster. Thus, clusters become more inclusive as we move up the hierarchy.
Dissimilarity is applied not just to single observations, but to groups as well (linkage). Thus the “Cadillac Fleetwood / Lincoln Continental” cluster " cluster fuses with “Chrysler Imperial” instead of “Maserati Bora” or something else.
You can also cut the tree to see how the tree varies:
# If we want only 5 clusters, for example (must be a number between 1-32, since mtcars has only 32 observations:
cutree(hclust_model, 5)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
## 1 1 1 2 3 2 3
## Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
## 1 1 1 1 2 2 2
## Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
## 3 3 3 4 4 4 1
## Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 2 2 3 3 4 1 1
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 3 1 5 1
ape packageThe ape package provides some great functionality for constructing and plotting clusters:
library(ape)
# various plots
plot(as.phylo(hclust_model))
plot(as.phylo(hclust_model), type = "cladogram")
plot(as.phylo(hclust_model), type = "unrooted")
# radial plot
colors = c("red", "orange", "blue", "green", "purple")
clus5 = cutree(hclust_model, 5)
plot(as.phylo(hclust_model), type = "fan", tip.color = colors[clus5], lwd = 2, cex = 1)
NOTE: the color settings for the radial plot apply to the other ape plots as well.
pvclust packageThe pvclust package offers a straightfoward way to perform hierarchical agglomerative clustering of columns with two types of p-values at each split: approximately unbiased (AU) and bootstrap probability (BP).
library(pvclust)
# Cluster features
# Ward's method: minimum variance between clusters
system.time({
pvclust_model_ward = pvclust(mtcars,
method.hclust = "ward.D",
method.dist = "euclidean",
nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
## user system elapsed
## 0.030 0.014 4.312
plot(pvclust_model_ward)
# pvrect will draw rectangles around clusters with high or low p-values
pvrect(pvclust_model_ward, alpha = 0.95)
# Complete linkage: largest intercluster difference
system.time({
pvclust_model_complete = pvclust(mtcars,
method.hclust = "complete",
method.dist = "euclidean",
nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
## user system elapsed
## 0.023 0.013 4.234
# Single linkage: smallest intercluster difference
system.time({
pvclust_model_single = pvclust(mtcars,
method.hclust = "single",
method.dist = "euclidean",
nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
## user system elapsed
## 0.024 0.012 4.128
# Average linkage: mean intercluster difference
system.time({
pvclust_model_average = pvclust(mtcars,
method.hclust = "average",
method.dist = "euclidean",
nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
## user system elapsed
## 0.024 0.014 4.118
# View summaries
pvclust_model_ward
##
## Cluster method: ward.D
## Distance : euclidean
##
## Estimates on edges:
##
## au bp se.au se.bp v c pchi
## 1 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 2 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.791 0.837 0.026 0.004 -0.895 -0.086 0.221
## 4 0.992 0.998 0.009 0.001 -2.667 -0.257 0.705
## 5 0.953 0.999 0.168 0.001 -2.476 -0.797 1.000
## 6 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 7 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 8 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 9 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 10 1.000 1.000 0.000 0.000 0.000 0.000 0.000
pvclust_model_complete
##
## Cluster method: complete
## Distance : euclidean
##
## Estimates on edges:
##
## au bp se.au se.bp v c pchi
## 1 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 2 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.595 0.604 0.030 0.005 -0.251 -0.012 0.545
## 4 1.000 1.000 0.001 0.000 -3.331 -0.016 0.738
## 5 0.932 0.836 0.012 0.004 -1.233 0.256 0.216
## 6 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 7 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 8 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 9 0.976 0.999 0.040 0.001 -2.476 -0.504 0.842
## 10 1.000 1.000 0.000 0.000 0.000 0.000 0.000
pvclust_model_single
##
## Cluster method: single
## Distance : euclidean
##
## Estimates on edges:
##
## au bp se.au se.bp v c pchi
## 1 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 2 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.639 0.591 0.029 0.005 -0.293 0.064 0.558
## 4 0.993 1.000 0.024 0.000 -2.916 -0.466 0.560
## 5 0.852 0.926 0.026 0.003 -1.245 -0.199 0.334
## 6 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 7 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 8 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 9 0.969 0.973 0.011 0.002 -1.899 -0.033 0.158
## 10 1.000 1.000 0.000 0.000 0.000 0.000 0.000
pvclust_model_average
##
## Cluster method: average
## Distance : euclidean
##
## Estimates on edges:
##
## au bp se.au se.bp v c pchi
## 1 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 2 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.534 0.491 0.031 0.005 -0.032 0.054 0.561
## 4 0.911 0.999 0.171 0.001 -2.227 -0.880 0.673
## 5 0.885 0.864 0.018 0.004 -1.151 0.052 0.397
## 6 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 7 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 8 1.000 1.000 0.000 0.000 0.000 0.000 0.000
## 9 0.994 0.997 0.005 0.001 -2.625 -0.086 0.888
## 10 1.000 1.000 0.000 0.000 0.000 0.000 0.000
# Plot Euclidean distance linkages
par(mfrow = c(2,2))
plot(pvclust_model_ward, main = "Ward", xlab = "", sub = "")
pvrect(pvclust_model_ward)
plot(pvclust_model_complete, main = "Complete", xlab = "", sub = "")
pvrect(pvclust_model_complete)
plot(pvclust_model_single, main = "Single", xlab = "", sub = "")
pvrect(pvclust_model_single)
plot(pvclust_model_average, main = "Average", xlab = "", sub = "")
pvrect(pvclust_model_average)
par(mfrow = c(1,1))
par(mfrow=c(2,2))
seplot(pvclust_model_ward, main = "Ward")
seplot(pvclust_model_complete, main = "Complete")
seplot(pvclust_model_single, main = "Single")
seplot(pvclust_model_average, main = "Average")
par(mfrow=c(1,1))
mclust packageThe mclust package provides “Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference.”
library(mclust)
# Fit model
mclust_model = Mclust(mtcars)
# View various plots
plot(mclust_model, what = "BIC")
plot(mclust_model, what = "classification")
plot(mclust_model, what = "uncertainty")
plot(mclust_model, what = "density")
summary(mclust_model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEI (diagonal, equal shape) model with 8 components:
##
## log.likelihood n df BIC ICL
## -358.1832 32 113 -1107.995 -1107.995
##
## Clustering table:
## 1 2 3 4 5 6 7 8
## 2 3 5 7 8 2 3 2
# sort mpg in decreasing order
mtcars = mtcars[order(-mtcars$mpg),]
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# create a binary factor variable from mpg: "less than 20mpg" and "greater than 20mpg"
mtcars$class = cut(mtcars$mpg,
breaks = c(0, 20, 40),
levels = c(1, 2),
labels = c("less than 20mpg", "greater than 20mpg"))
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb class
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 greater than 20mpg
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 greater than 20mpg
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 greater than 20mpg
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 greater than 20mpg
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 greater than 20mpg
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 greater than 20mpg
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 greater than 20mpg
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 greater than 20mpg
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 greater than 20mpg
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 greater than 20mpg
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 greater than 20mpg
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 greater than 20mpg
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 greater than 20mpg
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 greater than 20mpg
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 less than 20mpg
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 less than 20mpg
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 less than 20mpg
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 less than 20mpg
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 less than 20mpg
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 less than 20mpg
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 less than 20mpg
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 less than 20mpg
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 less than 20mpg
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 less than 20mpg
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 less than 20mpg
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 less than 20mpg
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 less than 20mpg
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 less than 20mpg
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 less than 20mpg
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 less than 20mpg
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 less than 20mpg
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 less than 20mpg
# define our predictors (X) and class labels (class)
X = mtcars[ , -12]
class = mtcars$class
# fit the model (EEE covariance structure, basically the same as linear discriminant analysis)
mclust_model2 = MclustDA(X, class = class, modelType = "EDDA", modelNames = "EEE")
# cross-validate!
set.seed(1)
cv_mclust = cvMclustDA(mclust_model2, nfold = 20)
# View cross-validation error and standard error of the cv error
cv_mclust[c("error", "se")]
## $error
## [1] 0.09375
##
## $se
## [1] 0.04519385
References and resources:
- Quick-R: Cluster Analysis
- James et al. Introduction to Statistical Learning, pp. 390-401
- pvclust
- STHDA: Beautiful dendrogram visualizations
- Gaston Sanchez: Visualizing Dendrograms in R
- Analysis of Phylogenetics and Evolution
- A Quick Tour of mclust
- mclust vignette (from 2012, but more detailed)
- A very useful walkthrough by Christian Lopez
- MoEClust: Gaussian Parsimonious Clustering Models with Gating and Expert Network Covariates
- See the cluster R package to learn more about agnes, clara, daisy, diana, fanny, flower, mona, and pam cluster methods!